Backport: Add a config to disable R2C Host retry (#14373) by thirtiseven · Pull Request #14409 · NVIDIA/spark-rapids

thirtiseven · 2026-03-12T10:10:04Z

Contributes to #14368

Description

This pr add a config spark.rapids.sql.rowToColumnar.retry.enabled to disable the per-row retry introduced in #13842, which caused performance issues in some cases.

Also added a test.

Checklists

This PR has added documentation for new or modified features or behaviors.
This PR has added new tests or modified existing tests to cover new code paths.
(Please explain in the PR description how the new code paths are tested, such as names of the new/existing tests that cover them.)
Performance testing has been performed and its results are added in the PR description. Or, an issue has been filed with a link in the PR description.

Contributes to NVIDIA#14368 ### Description This pr add a config `spark.rapids.sql.rowToColumnar.retry.enabled` to disable the per-row retry introduced in NVIDIA#13842, which caused performance issues in some cases. Also added a test. ### Checklists - [ ] This PR has added documentation for new or modified features or behaviors. - [x] This PR has added new tests or modified existing tests to cover new code paths. (Please explain in the PR description how the new code paths are tested, such as names of the new/existing tests that cover them.) - [ ] Performance testing has been performed and its results are added in the PR description. Or, an issue has been filed with a link in the PR description. --------- Signed-off-by: Haoyang Li <haoyangl@nvidia.com> Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Co-authored-by: Sameer Raheja <sameerz@users.noreply.github.com>

thirtiseven · 2026-03-12T10:10:16Z

build

greptile-apps · 2026-03-12T10:14:04Z

Greptile Summary

This PR adds an internal config spark.rapids.sql.rowToColumnar.retry.enabled (defaulting to false) that disables the per-row retry wrapper introduced in #13842 for RowToColumnarIterator, addressing a reported performance regression. When disabled, each row is converted directly without a withRetryNoSplit/withRestoreOnRetry wrapper, removing per-row overhead at the cost of propagating host OOM exceptions directly to the task level. The builders.tryBuild(rowCount) call remains wrapped in withRetryNoSplit in both paths, so GPU OOM at the build step is still retried.

RowToColumnarIterator gains an enableRetry: Boolean = false constructor parameter that gates the two conversion branches.
Both production call sites (GpuRowToColumnarExec and ParquetCachedBatchSerializer) now read rapidsConf.isR2cRetryEnabled and forward it to the iterator.
Four new tests are added: CPU OOM without retry (propagates), CPU OOM with retry (recovers), multi-batch retry correctness, and a single-batch recovery case using a custom rowThatFailsOnceWithCpuRetryOOM helper that throws exactly once.
Existing retry tests are updated to pass enableRetry = true explicitly to preserve original behaviour coverage.

Confidence Score: 5/5

Safe to merge — the change is gated behind an internal config that defaults to the new (retry-disabled) behaviour, and all production call sites have been updated.
The implementation is logically correct: the non-retry path removes only the per-row OOM wrapper while keeping the final builders.tryBuild retry intact. Both production callers read and forward the new config. The new tests directly validate all four scenarios (retry on/off × OOM recovery/propagation). No pre-existing retry tests have been broken; they were updated to pass enableRetry = true explicitly. The config is marked .internal() so no user-facing documentation update is strictly required.
No files require special attention.

Important Files Changed

Filename	Overview
sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuRowToColumnarExec.scala	Adds `enableRetry: Boolean = false` parameter to `RowToColumnarIterator`, splitting `buildBatch()` into two branches: the existing per-row retry path and a new direct-conversion path. Removed a leftover TODO comment. Logic is sound; the `builders.tryBuild(rowCount)` remains wrapped in `withRetryNoSplit` in both paths.
sql-plugin/src/main/scala/com/nvidia/spark/rapids/RapidsConf.scala	Adds `ENABLE_R2C_RETRY` internal config `spark.rapids.sql.rowToColumnar.retry.enabled` defaulting to `false`, and the corresponding `isR2cRetryEnabled` lazy accessor. Documentation in the `.doc()` call correctly describes the trade-off.
sql-plugin/src/main/scala/com/nvidia/spark/rapids/parquet/ParquetCachedBatchSerializer.scala	Reads `enableR2cRetry` from `rapidsConf` and forwards it to `RowToColumnarIterator`. The change correctly threads the new config through the Parquet batch-serialization path.
tests/src/test/scala/com/nvidia/spark/rapids/RowToColumnarIteratorRetrySuite.scala	Updates existing tests to pass `enableRetry = true` explicitly, and adds four new tests covering CPU OOM with and without retry (single-batch and multi-batch). The `rowThatFailsOnceWithCpuRetryOOM` helper correctly uses a per-instance `var failed` so each returned row has independent throw-once semantics.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A["RowToColumnarIterator.buildBatch()"] --> B{enableRetry?}

    B -- true --> C["Create RetryableRowConverter\n(with UnsafeProjection)"]
    C --> D["converter.attempt(rowIter.next())"]
    D --> E["withRetryNoSplit {\n  withRestoreOnRetry(converter) {\n    converters.convert(currentRow, builders)\n  }\n}"]
    E -- OOM --> F["converter.restore()\nRollback builders + copy row"]
    F --> E
    E -- success --> G["byteCount += bytesWritten\nrowCount += 1"]

    B -- false --> H["val row = rowIter.next()"]
    H --> I["converters.convert(row, builders)"]
    I -- OOM --> J["CpuRetryOOM propagates\n(task may fail)"]
    I -- success --> K["byteCount += bytes\nrowCount += 1"]

    G --> L{more rows\nwithin target?}
    K --> L
    L -- yes --> D
    L -- yes --> H
    L -- no --> M["withRetryNoSplit {\n  builders.tryBuild(rowCount)\n}"]
    M --> N["return ColumnarBatch"]

_{Last reviewed commit: a3f34e9}

thirtiseven self-assigned this Mar 12, 2026

sameerz added the performance A performance related task/issue label Mar 12, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Backport: Add a config to disable R2C Host retry (#14373)#14409

Backport: Add a config to disable R2C Host retry (#14373)#14409
thirtiseven wants to merge 1 commit intoNVIDIA:release/26.02from
thirtiseven:backport_r2c_config

thirtiseven commented Mar 12, 2026

Uh oh!

thirtiseven commented Mar 12, 2026

Uh oh!

greptile-apps bot commented Mar 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

thirtiseven commented Mar 12, 2026

Description

Checklists

Uh oh!

thirtiseven commented Mar 12, 2026

Uh oh!

greptile-apps bot commented Mar 12, 2026

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Flowchart

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants